Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

نویسندگان

  • Ping Li
  • Anshumali Shrivastava
  • Arnd Christian König
چکیده

Our recent work on large-scale learning using b-bit minwise hashing [21, 22] was tested on the webspam dataset (about 24 GB in LibSVM format), which may be way too small compared to real datasets used in industry. Since we could not access the proprietary dataset used in [31] for testing the Vowpal Wabbit (VW) hashing algorithm, in this paper we present an experimental study based on the expanded rcv1 dataset (about 200 GB in LibSVM format). In our earlier report [22], the experiments demonstrated that, with merely 200 hashed values per data point, b-bit minwise hashing can achieve similar test accuracies as VW with 10 hashed values per data point, on the webspam dataset. In this paper, our new experiments on the (expanded) rcv1 dataset clearly agree with our earlier observation that b-bit minwise hashing algorithm is substantially more accurate than VW hashing algorithm at the same storage. For example, with 2 (16384) hashed values per data point, VW achieves similar test accuracies as b-bit hashing with merely 30 hashed values per data point. This is of course not surprising as the report [22] has already demonstrated that the variance of the VW algorithm can be order of magnitude(s) larger than the variance of b-bit minwise hashing. It was shown in [22] that VW has the same variance as random projections.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

b-Bit Minwise Hashing for Large-Scale Learning

Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and ...

متن کامل

b-Bit Minwise Hashing for Large-Scale Linear SVM

Linear Support Vector Machines (e.g., SVM, Pegasos, LIBLINEAR) are powerful and extremely efficient classification tools when the datasets are very large and/or highdimensional, which is common in (e.g.,) text classification. Minwise hashing is a popular technique in the context of search for computing resemblance similarity between ultra high-dimensional (e.g., 2) data vectors such as document...

متن کامل

Distributed Newton Methods for Regularized Logistic Regression

Regularized logistic regression is a very useful classification method, but for large-scale data, its distributed training has not been investigated much. In this work, we propose a distributed Newton method for training logistic regression. Many interesting techniques are discussed for reducing the communication cost and speeding up the computation. Experiments show that the proposed method is...

متن کامل

One Permutation Hashing

Abstract Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 per...

متن کامل

One Permutation Hashing for Efficient Search and Learning

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, the method of b-bit minwise hashing has been applied to large-scale linear learning (e.g., linear SVM or logistic regression) and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing cost...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1108.3072  شماره 

صفحات  -

تاریخ انتشار 2011